Skip to content

Unified input API: artifact_id + data replaces input_csv#207

Merged
RafaelPo merged 8 commits intomainfrom
feat/unified-input-api-v2
Feb 24, 2026
Merged

Unified input API: artifact_id + data replaces input_csv#207
RafaelPo merged 8 commits intomainfrom
feat/unified-input-api-v2

Conversation

@RafaelPo
Copy link
Contributor

Summary

  • Replace input_csv with a unified artifact_id + data input API across all MCP tools (screen, rank, dedupe, merge, agent, single_agent)
  • Add uploads.py with HMAC-signed upload URL support for multi-pod deployments
  • Move input resolution logic to model properties; add Google Sheets /pub URL normalization
  • Reject empty inline data (data=[]) in validators to prevent zero-row task submissions
  • Log input source (artifact_id vs data) on task submission

Test plan

  • All 257 unit tests pass
  • Input validation: artifact_id and data are mutually exclusive, empty data rejected
  • Google Sheets URL normalization handles /pub, /edit, and export URLs
  • HMAC upload URL signing roundtrip, expiry, and tamper detection
  • Manual: run integration tests against staging API

🤖 Generated with Claude Code

RafaelPo and others added 6 commits February 24, 2026 19:46
Processing tools now accept artifact_id (UUID from upload_data) or
data (list[dict]) instead of input_csv/input_data/input_json. Adds
upload_data tool for URL/file ingestion and request_upload_url for
presigned large-file uploads in HTTP mode.

Phase 1: Simplified _SingleSourceInput and MergeInput models
Phase 2: upload_data tool (URL + local path + Google Sheets)
Phase 3: Presigned URL upload system (HMAC, Redis metadata, REST endpoint)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Adopt _aid_or_dataframe and _input_data_mode properties on models,
  removing free functions from tools.py
- Add left/right properties to MergeInput
- Use `is not None` for artifact_id checks
- Fix sync context manager in uploads.py (with → async with)
- Remove duplicate max_inline_rows in config.py
- Update integration tests to use data instead of removed input_csv
- Reject empty CSV in upload_data local file path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- /pub URLs now correctly convert to /export?format=csv
- Headers-only CSV from URL raises clear "empty CSV" error instead
  of misleading "could not parse as CSV or JSON"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove auto-generated per-process secret — it breaks when pods
don't share state. Now fails fast with a clear error if unset.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add len(v) == 0 check to _SingleSourceInput and MergeInput
validate_data_size validators to prevent wasteful zero-row task
submissions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@RafaelPo
Copy link
Contributor Author

@claude code review

@github-actions
Copy link
Contributor

github-actions bot commented Feb 24, 2026

Claude finished @RafaelPo's task in 2m 12s —— View job


Code Review: Unified Input API v2

  • Read repo guidelines
  • Read all changed files and understand the diff
  • Analyze code for bugs, security issues, and improvements
  • Post review findings

Overall this is a well-structured refactor. The artifact_id + data API is cleaner than the previous input_csv + data (CSV string | JSON) approach, and the upload system is solid. Tests cover the important paths. A few issues worth addressing:


1. SSRF risk in fetch_csv_from_url (security)

everyrow-mcp/src/everyrow_mcp/utils.py:63-76

fetch_csv_from_url follows redirects and fetches arbitrary user-supplied URLs with no restrictions on internal/private IPs. In HTTP mode (server-side), this enables SSRF against cloud metadata endpoints (http://169.254.169.254/...), internal services, etc.

Consider adding a deny-list for private IP ranges, or using a library like ssrf-guard to validate resolved addresses before connecting.

Fix this →


2. No response size limit when fetching URLs (resource exhaustion)

everyrow-mcp/src/everyrow_mcp/utils.py:74 / everyrow-mcp/src/everyrow_mcp/tools.py:624-625

The upload endpoint enforces max_upload_size_bytes (50 MB), but fetch_csv_from_url reads the full response into memory with no size cap. A large URL response could cause OOM. Consider streaming the response and enforcing a size limit consistent with max_upload_size_bytes.

Fix this →


3. Stale docstring references left_csv / right_csv (bug)

everyrow-mcp/src/everyrow_mcp/tools.py:456-457

The everyrow_merge docstring still says:

left_csv = the table being enriched (ALL its rows appear in the output).
right_csv = the lookup/reference table (its columns are appended to matches).

These should reference the new field names (left_artifact_id/left_data and right_artifact_id/right_data).

Fix this →


4. Misleading upload_secret description (config)

everyrow-mcp/src/everyrow_mcp/config.py:81

Description says "Auto-generated if empty" but _get_secret() in uploads.py:58 raises RuntimeError when empty. The description should say something like "Required in HTTP mode for HMAC signing."

Fix this →


5. Dead code: load_data function (cleanup)

everyrow-mcp/src/everyrow_mcp/utils.py:206-258

load_data is no longer imported in production code (tools.py now uses params._aid_or_dataframe instead). The test comment at test_utils.py:105 says "still used by upload_data" but the upload_data tool uses pd.read_csv / fetch_csv_from_url directly. Consider removing load_data and its tests, or if it's intended as a public utility, update the comment.


6. Internal errors leaked in upload error responses (hardening)

everyrow-mcp/src/everyrow_mcp/uploads.py:204,218

return JSONResponse({"error": f"Could not parse CSV: {exc}"}, status_code=400)
...
return JSONResponse({"error": f"Failed to create artifact: {exc}"}, status_code=500)

Exception messages can leak internal details (stack frames, file paths, DB errors). For the 500 case especially, consider logging the full error and returning a generic message to the client.


7. Minor: _is_url naming convention

everyrow-mcp/src/everyrow_mcp/utils.py:14

_is_url uses a private-by-convention leading underscore but is imported across modules (tools.py, models.py). Consider renaming to is_url to signal it's part of the public API for this package.


Summary

The core design is good — the artifact_id + data API is a clean improvement, validation is thorough, and the HMAC upload system is well-structured with one-use tokens via Redis GETDEL. The issues above are focused on SSRF protection (items 1-2) and correctness (items 3-4), which are the most important to address before merge. Items 5-7 are minor cleanup that can be done at any time.

- Fix misleading upload_secret description: say "Required in HTTP mode"
  instead of "Auto-generated if empty" (it raises RuntimeError)
- Remove unused load_data() and its 8 tests — superseded by model
  properties (_aid_or_dataframe) and upload_data tool
- Rename _is_url → is_url since it's imported across modules

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@RafaelPo RafaelPo requested review from hnykda and rgambee February 24, 2026 19:57
task_type=PublicTaskType.AGENT,
session_url=session_url,
total=len(df),
total=total,

This comment was marked as outdated.

When a task is submitted via artifact_id (not inline data), the row
count is unknown client-side. The fallback labels now say "artifact"
instead of the vague "rows for X" to make clear that the server-side
artifact is being processed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@RafaelPo RafaelPo merged commit 5fd8d45 into main Feb 24, 2026
5 checks passed
@RafaelPo RafaelPo deleted the feat/unified-input-api-v2 branch February 24, 2026 20:05

if is_url(params.source):
df = await fetch_csv_from_url(params.source)
else:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The everyrow_upload_data function lacks error handling for malformed local CSV files, which will cause an unhandled exception during parsing with pandas.
Severity: MEDIUM

Suggested Fix

Wrap the pd.read_csv(params.source) call within a try-except block. Catch potential exceptions from pandas, such as ParserError, and return a user-friendly error message, similar to the error handling implemented for the REST endpoint in uploads.py.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: everyrow-mcp/src/everyrow_mcp/tools.py#L626

Potential issue: The `everyrow_upload_data` function does not handle potential errors
when parsing a local CSV file using `pd.read_csv`. While the existence of the file is
checked, the content is not validated. If the file is malformed or not a valid CSV,
`pd.read_csv` will raise an unhandled exception, which will propagate up and cause the
tool to fail. This is inconsistent with other parts of the application, such as the REST
endpoint and the URL-based upload functionality, which both include explicit error
handling for CSV parsing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant